Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fcoll/vulcan accelerator support #12678

Conversation

edgargabriel
Copy link
Member

No description provided.

@edgargabriel edgargabriel changed the title Topic/fcoll vulcan accelerator support fcoll/vulcan accelerator support Jul 13, 2024
@edgargabriel edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch from 74a3029 to 8b24867 Compare July 18, 2024 19:06
@edgargabriel edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch 2 times, most recently from 0beeef3 to f8bc3fd Compare August 12, 2024 21:04
edgargabriel and others added 3 commits September 3, 2024 06:26
If the user user input buffers are GPU device memory, use also GPU device memory for the aggregation step. This will allow the data transfer to occur between GPU buffers, and hence take advantage of the much higher GPU-GPU interconnects (e.g. XGMI, NVLINK, etc.).

The downside of this approach is that we cannot call directly into the fbtl ipwritev routine, but have to go through the common_ompio_file_iwrite_pregen routine, which performs the necessary segmenting and staging through the host memory.

Signed-off-by: Edgar Gabriel <[email protected]>
add support for using accelerator buffers in the aggregation step of the read_all operation.
This is in common/ompio instead of the fcoll component, since all fcoll components (except individual)
use at the momment the default implementation, which has been moved to common/ompio a while back to
avoid code duplication.

Signed-off-by: Edgar Gabriel <[email protected]>
performance measurements indicate that in most cases using a CPU host
buffer for data aggregation will lead to better performance than using a
GPU buffer. So turn the feature off by default.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/fcoll-vulcan-accelerator-support branch from f8bc3fd to d30471c Compare September 3, 2024 13:26
@qkoziol
Copy link
Contributor

qkoziol commented Sep 3, 2024

Looks fine to me.

I guess my only question is why the mca_common_ompio_file_iread_pregen / mca_common_ompio_file_iwrite_pregen routines are necessary, if the code was previously calling the preadv function.

@qkoziol
Copy link
Contributor

qkoziol commented Sep 3, 2024

Looks fine to me.

I guess my only question is why the mca_common_ompio_file_iread_pregen / mca_common_ompio_file_iwrite_pregen routines are necessary, if the code was previously calling the preadv function.

Ah, this is for the ipreadv function.

So, why not use it for CPU memory buffers also?

@edgargabriel
Copy link
Member Author

edgargabriel commented Sep 4, 2024

@qkoziol thank you for your review! Let me try to clarify your question, and also use this as an opportunity for documenting some of the changes. The pipeline protocol is used for individual I/O in cases where we need to use an additional staging buffer for the operation, e.g. GPU buffers or if we need to perform data conversion for a different data representation. Regular file_read/write operations don't need the additional staging step.

When doing data aggregation into GPU buffers in collective I/O, we therefore cannot simply call the fbtl/ipreadv or fbtl/ipwritev function (as we do for host buffers), but we want to invoke the pipeline protocol. However, in contrary to the individual I/O operations, some of the operations are not necessary. Specifically, we can use the pre-calculated offsets from the collective I/O operation (and hence don't need to repeat the file-view operations anymore), and we don't need to update the file pointer position (that is also done in the collective I/O operation). Hence, these are the two iread_pregen/iwrite_pregen functions.

Lastly, each collective component has its own write_all operation, but they all use the same algorithm for the read_all, which has because of that been moved from the components into the common/ompio directory. This is something that might have to change in the near future, but our focus in the past was always on the write_all operations and we neglected read_all a bit.

@edgargabriel edgargabriel merged commit 1afb524 into open-mpi:main Sep 4, 2024
14 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants